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ABSTRACT 

In  this  paper,  we  describe  the  IBM  MASTOR,  a  speech-to-speech 
translation  system  that  can  translate  spontaneous  free-form 
speech  in  real-time  on  both  laptop  and  hand-held  PDAs.  Chal¬ 
lenges  include  speech  recognition  and  machine  translation  in 
adverse  environments,  lack  of  training  data  and  linguistic  re¬ 
sources  for  under-studied  languages,  and  the  need  to  rapidly  de¬ 
velop  capabilities  for  new  languages.  Another  challenge  is  de¬ 
signing  algorithms  and  building  models  in  a  scalable  manner  to 
perform  well  even  on  memory  and  CPU  deficient  hand-held  com¬ 
puters.  We  describe  our  approaches,  experience,  and  success  in 
building  working  free-form  S2S  systems  that  can  handle  two 
language  pairs  (including  a  low-resource  language). 


1.  INTRODUCTION 

Automatic  speech-to-speech  (S2S)  translation  breaks  down  com¬ 
munication  barriers  between  people  who  do  not  share  a  common 
language  and  hence  enable  instant  oral  cross-lingual  communica¬ 
tion  for  many  critical  applications  such  as  emergency  medical 
care.  The  development  of  an  accurate,  efficient  and  robust  S2S 
translation  system  poses  a  lot  of  challenges.  This  is  especially 
true  for  colloquial  speech  and  resource  deficient  languages. 

The  IBM  MASTOR  speech-to-speech  translation  system  has  been 
developed  for  the  DARPA  CAST  and  Transtac  programs  whose 
mission  is  to  develop  technologies  that  enable  rapid  deployment 
of  real-time  S2S  translation  of  low-resource  languages  on  port¬ 
able  devices.  It  originated  from  the  IBM  MARS  S2S  system 
handling  the  air  travel  reservation  domain  described  in  [1],  which 
was  later  significantly  improved  in  all  components,  including 
ASR,  MT  and  TTS,  and  later  evolved  into  the  MASTOR  multi¬ 
lingual  S2S  system  that  covers  much  broader  domains  such  as 
medical  treatment  and  force  protection  [2,3].  More  recently,  we 
have  further  broadened  our  experience  and  efforts  to  very  rapidly 
develop  systems  for  under-studied  languages,  such  as  regional 
dialects  of  Arabic.  The  intent  of  this  program  is  to  provide  lan¬ 
guage  support  to  military,  medical  and  humanitarian  personnel 
during  operations  in  foreign  territories,  by  deciphering  possibly 
critical  language  communications  with  a  two-way  real-time 
speech-to-speech  translation  system  designed  for  specific  tasks 
such  as  medical  triage  and  force  protection. 

The  initial  data  collection  effort  for  the  project  has  shown  that  the 
domain  of  force  protection  and  medical  triage  is,  though  limited, 
rather  broad.  In  fact,  the  definition  of  domain  coverage  is  tough 
when  the  speech  from  responding  foreign  language  speakers  are 
concerned,  as  their  responses  are  less  constrained  and  may  in¬ 
clude  out-of-domain  words  and  concepts.  Moreover,  flexible 
casual  or  colloquial  speaking  style  inevitably  appears  in  the  hu- 
man-to-human  conversational  communications.  Therefore,  the 
project  is  a  great  challenge  that  calls  for  major  research  efforts. 
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Among  all  the  challenges  for  speech  recognition  and  translation 
for  under-studied  languages,  there  are  two  main  issues:  1)  Lack  of 
appropriate  amount  of  speech  data  that  represent  the  domain  of 
interest  and  the  oral  language  spoken  by  the  target  speakers,  re¬ 
sulting  in  difficulties  in  accurate  estimation  of  statistical  models 
for  speech  recognition  and  translation.  2)  Lack  of  linguistic 
knowledge  realization  in  spelling  standards,  transcriptions,  lexi¬ 
cons  and  dictionaries,  or  annotated  corpora.  Therefore,  various 
different  approaches  have  to  be  explored. 

Another  critical  challenge  is  to  embed  complicated  algorithms 
and  programs  into  small  devices  for  mobile  users.  A  hand-held 
computing  device  may  have  a  CPU  of  256MHz  and  64MB  mem¬ 
ory;  to  fit  the  programs,  as  well  as  the  models  and  data  files  into 
this  memory  and  operate  the  system  in  real-time  are  tremendous 
challenges  [4]. 

In  this  paper,  we  will  describe  the  overall  framework  of  the 
MASTOR  system  and  our  approaches  for  each  major  component, 
i.e.,  speech  recognition  and  translation.  Various  statistical  ap¬ 
proaches  [5, 6,7,8]  are  explored  and  used  to  solve  different  techni¬ 
cal  challenges.  We  will  show  how  we  addressed  the  challenges 
that  arise  when  building  automatic  speech  recognition  (ASR)  and 
machine  translation  (MT)  for  colloquial  Arabic  on  both  the  laptop 
and  handheld  PDA  platforms. 


2.  SYSTEM  OVERVIEW 

The  general  framework  of  our  speech  translation  system  is  illus¬ 
trated  in  Figure  1 .  The  general  framework  of  our  MASTOR  sys¬ 
tem  has  components  of  ASR,  MT  and  TTS.  The  cascaded  ap¬ 
proach  allows  us  to  deploy  the  power  of  the  existing  advanced 
speech  and  language  processing  techniques,  while  concentrating 
on  the  unique  problems  in  speech-to-speech  translation.  Figure  2 
illustrates  the  MASTOR  GUI  (Graphic  User  Interface)  on  laptop 
and  PDA,  respectively. 


Figure  1  IBM  MASTOR  Speech-to-Speech  Translation  System 

Acoustic  models  for  English  and  Mandarin  baseline  are  devel¬ 
oped  for  large-vocabulary  continuous  speech  and  trained  on  over 
200  hours  of  speech  collected  from  about  2000  speakers  for  each 
language.  However,  the  Arabic  dialect  speech  recognizer  was 
only  trained  using  about  50  hours  of  dialectal  speech.  The  train¬ 
ing  data  for  Arabic  consists  of  about  200K  short  utterances.  Large 
efforts  were  invested  in  initial  cleaning  and  normalization  of  the 
training  data  because  of  large  number  of  irregular  dialectal  words 
and  variations  in  spellings.  We  experimented  with  three  ap¬ 
proaches  for  pronunciation  and  acoustic  modeling:  i.e.  grapheme, 
phonetic,  and  context-sensitive  grapheme  as  will  be  described  in 
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section  3.A.  We  found  that  using  context-sensitive  pronunciation 
rules  reduces  the  WER  of  the  grapheme  based  acoustic  model  by 
about  3%  (from  36.7%  to  35.8%).  Based  on  these  results,  we 
decided  to  use  context-sensitive  grapheme  models  in  our  system. 

The  Arabic  language  model  (LM)  is  an  interpolated  model  con¬ 
sisting  of  a  trigram  LM,  a  class-based  LM  and  a  morphologically 
processed  LM,  all  trained  from  a  corpus  of  a  few  hundred  thou¬ 
sand  words.  We  also  built  a  compact  language  model  for  the 
hand-held  system,  where  singletons  are  eliminated  and  bigram 
and  trigram  counts  are  pruned  with  increased  thresholds.  The  LM 
footprint  size  is  10MB. 

There  are  two  approaches  for  translation.  The  concept  based  ap¬ 
proach  uses  natural  language  understanding  (NLU)  and  natural 
language  generation  models  trained  from  an  annotated  corpus. 
Another  approach  is  the  phrase-based  finite  state  transducer 
which  is  trained  using  an  un-annotated  parallel  corpus. 

A  trainable,  phrase-splicing  and  variable  substitution  TTS  system 
is  adopted  to  synthesize  speech  from  translated  sentences,  which 
has  a  special  ability  to  generate  speech  of  mixed  languages  seam¬ 
lessly  [9].  In  addition,  a  small  footprint  TTS  is  developed  for  the 
handheld  devices  using  embedded  concatenative  TTS  technolo¬ 
gies.  [10] 

Next,  we  will  describe  our  approaches  in  automatic  speech  recog¬ 
nition  and  machine  translation  in  greater  detail. 


3.  AUTOMATIC  SPEECH  RECOGNITION 

A.  Acoustic  Models 

Acoustic  models  and  the  pronunciation  dictionary  greatly  influ¬ 
ence  the  ASR  performance.  In  particular,  creating  an  accurate 
pronunciation  dictionary  poses  a  major  challenge  when  changing 
the  language.  Deriving  pronunciations  for  resource  rich  languages 
like  English  or  Mandarin  is  relatively  straight  forward  using  ex¬ 
isting  dictionaries  or  letter  to  sound  models.  In  certain  languages 
such  as  Arabic  and  Hebrew,  the  written  form  does  not  typically 
contain  short  vowels  which  a  native  speaker  can  infer  from  con¬ 
text.  Deriving  automatic  phonetic  transcription  for  speech  corpora 
is  thus  difficult.  This  problem  is  even  more  apparent  when  con¬ 
sidering  colloquial  Arabic,  mainly  due  to  the  large  number  of 
irregular  dialectal  words. 

One  approach  to  overcome  the  absence  of  short  vowels  is  to  use 
grapheme  based  acoustic  models.  This  leads  to  straightforward 
construction  of  pronunciation  lexicons  and  hence  facilitates 
model  training  and  decoding.  However,  the  same  grapheme  may 
lead  to  different  phonetic  sounds  depending  on  its  context.  This 
results  in  less  accurate  acoustic  models.  For  this  reason  we  ex¬ 
perimented  with  two  other  different  approaches.  The  first  is  a  full 
phonetic  approach  which  uses  short  vowels,  and  the  second  uses 
context-sensitive  graphemes  for  the  letter  "A"  (Alif)  where  two 
different  phonemes  are  used  for  "A"  depending  on  its  position  in 
the  word. 

Using  phoneme  based  pronunciations  would  require  vowelization 
of  every  word.  To  perform  vowelization,  we  used  a  mix  of  dic¬ 
tionary  search  and  a  statistical  approach.  The  word  is  first 
searched  in  an  existing  vowelized  dictionary,  and  if  not  found  it  is 
passed  to  the  statistical  vowelizer  [11].  Due  to  the  difficulties  in 
accurately  vowelizing  dialectal  words,  our  experiments  have  not 
shown  any  improvements  using  phoneme  based  ASR  compared 
to  grapheme  based. 

Speech  recognition  for  both  the  laptop  and  hand-held  systems  is 
based  on  the  IBM  ViaVoice  engine.  This  highly  robust  and  effi¬ 
cient  framework  uses  rank  based  acoustic  scores  [12]  which  are 
derived  from  tree-clustered  context  dependent  Gaussian  models. 
These  acoustic  scores  together  with  n-gram  LM  probabilities  are 
incorporated  into  a  stack  based  search  algorithm  to  yield  the  most 
probable  word  sequence  given  the  input  speech. 

The  English  acoustic  models  use  an  alphabet  of  52  phones.  Each 
phone  is  modeled  with  a  3-state  left-to-right  hidden  Markov 
model  (HMM).  The  system  has  approximately  3,500  context- 
dependent  states  modeled  using  42K  Gaussian  distributions  and 
trained  using  40  dimensional  features.  The  context-dependent 
states  are  generated  using  a  decision-tree  classifier.  The  collo¬ 
quial  Arabic  acoustic  models  use  about  30  phones  that  essentially 
correspond  to  graphemes  in  the  Arabic  alphabet.  The  colloquial 
Arabic  HMM  structure  is  the  same  as  that  of  the  English  model. 
The  Arabic  acoustic  models  are  also  built  using  40  dimensional 
features.  The  compact  model  for  the  PDA  has  about  2K  leaves 
and  28K  Gaussian  distributions.  The  laptop  version  has  over  3K 
leaves  and  60K  Gaussians.  All  acoustic  models  are  trained  using 
discriminative  training  [13]. 

B.  Language  Modeling 

Language  modeling  (LM)  of  the  probability  of  various  word  se¬ 
quences  is  crucial  for  high-performance  ASR  of  free-style  open- 


ended  coversational  systems.  Our  approaches  to  build  statistical 
tri-gram  LMs  fall  into  three  categories:  1)  obtaining  additional 
training  material  automatically;  2)  interpolating  domain-specific 
LMs  with  other  LMs;  3)  improving  distribution  estimation  ro¬ 
bustness  and  accuracy  with  limited  in-domain  resources.  Auto¬ 
matic  data  collection  and  expansion  is  the  most  straight-forward 
way  to  achieve  efficient  LM,  especially  when  little  in-domain 
data  is  available.  For  resource-rich  languages  such  as  English  and 
Chinese,  we  retrieve  additional  data  from  the  World  Wide  Web 
(WWW)  to  enhance  our  limited  domain  specific  data,  which 
shows  significant  improvement  [6]. 

In  Arabic,  words  can  take  prefixes  and  suffixes  to  generate  new 
words  which  are  semantically  related  to  the  root  form  of  the  word 
(stem).  As  a  result,  the  vocabulary  size  in  Arabic  can  become 
very  large  even  for  specific  domains.  To  alleviate  this  problem, 
we  built  a  language  model  on  morphologically  tokenized  data  by 
applying  morphological  analysis  and  hence  splitting  some  of  the 
words  into  prefix+stem+suffix,  prefix+stem  or  stem+suffix  forms. 
We  refer  the  reader  to  [14]  to  learn  more  about  the  morphological 
tokenization  algorithm.  Morphological  analysis  reduced  the  vo¬ 
cabulary  size  by  about  30%  without  sacrificing  the  coverage. 

More  specifically,  in  our  MASTOR  system,  the  English  language 
model  has  two  components  that  are  linearly  interpolated.  The  first 
one  is  built  using  in-domain  data.  The  second  component  acts  as  a 
background  model  and  is  built  using  a  very  large  generic  text 
inventory  that  is  domain  independent.  The  language  model  counts 
are  also  pruned  to  control  the  size  of  this  background  model.  The 
colloquial  Arabic  language  model  for  our  laptop  system  is  com¬ 
posed  of  three  components  that  are  linearly  interpolated.  The  first 
one  is  the  basic  word  tri-gram  model.  The  second  one  is  a  class 
based  language  model  with  13  classes  that  covers  names  for  Eng¬ 
lish  and  Arabic,  numbers,  months,  days,  etc.  The  third  one  is  the 
morphological  language  model  described  above. 

4.  SPEECH  TRANSLATION 

A.  NLU/NLG-based  Speech  Translation 

One  of  the  translation  algorithms  we  proposed  and  applied  in 
MASTOR  is  the  statistical  translation  method  based  on  natural 
language  understanding  (NLU)  and  natural  language  generation 
(NLG).  Statistical  machine  translation  methods  translate  a  sen¬ 
tence  W  in  the  source  language  into  a  sentence  A  in  the  target 
language  by  using  a  statistical  model  that  estimates  the  probabil¬ 
ity  of  A  given  W,  i.e.  p(A|w)-  Conventionally,  p(A|w)  is  opti¬ 
mized  on  a  set  of  pairs  of  sentences  that  are  translations  of  one 
another.  To  alleviate  this  data  sparseness  problem  and,  hence, 
enhance  both  the  accuracy  and  robustness  of  estimating  p(A|w)> 

we  proposed  a  statistical  concept-based  machine  translation  para¬ 
digm  that  predicts  A  with  not  only  W  but  also  the  underlying  con¬ 
cepts  embedded  in  W  and/or  A.  As  a  result,  the  optimal  sentence 
A  is  picked  by  first  understanding  the  meaning  of  the  source  sen¬ 
tence  W. 

Let  C  denote  the  concepts  in  the  source  language  and  S  denote  the 
concepts  in  the  target  language,  our  proposed  statistical  concept- 

based  algorithm  should  select  a  word  sequence  A  as 
A  =  arg  max  p(a|VL  )  =  arg  max  |£p(a|5,c,vf)p(s|c,it)p(c|w)| 


where  the  conditional  probabilities  p(c|w)  >  p(s\C,w)  and 
p{A\S,C,w)  are  estimated  by  the  Natural  Language  Understand¬ 
ing  (NLU),  Natural  Concept  Generation  (NCG)  and  Natural 
Word  Generation  (NWG)  procedures,  respectively.  The  probabil¬ 
ity  distributions  are  estimated  and  optimized  upon  a  pre-annotated 
bilingual  corpus.  In  our  MASTOR  system,  p(c|w)  is  estimated 

by  a  decision-tree  based  statistical  semantic  parser,  and 
p(s|C,w)  and  p(A\S,C,w)  are  estimated  by  maximizing  the 

conditional  entropy  as  depicted  in  [2]  and  [7],  respectively. 

We  are  currently  developing  a  new  translation  method  that  unifies 
statistical  phrase-based  translation  models  and  the  above 
NLU/NLG  based  approach.  We  will  discuss  this  work  in  future 
publications. 


B.  Fast  and  Memory  Efficient  Machine  Translation  Using  SIPL 
Another  translation  method  we  proposed  in  MASTOR  is  based  on 
the  Weighted  Finite-State  Transducer  (WFST).  In  particular,  we 
developed  a  novel  phrase-based  translation  framework  using 
WFSTs  that  achieves  both  memory  efficiency  and  fast  speed, 
which  is  suitable  for  real  time  speech-to-speech  translation  on 
scalable  computational  platforms.  In  the  proposed  framework  [15] 
which  we  refer  to  as  Statistical  Integrated  Phrase  Lattices  (SIPLs), 
we  statically  construct  a  single  optimized  WFST  encoding  the 
entire  translation  model.  In  addition,  we  introduce  a  Viterbi  de¬ 
coder  that  can  combine  the  translation  model  and  language  model 
FSTs  with  the  input  lattice  efficiently,  resulting  in  translation 
speeds  of  up  to  thousands  of  words  per  second  on  a  PC  and  hun¬ 
dred  words  per  second  on  a  PDA  device.  This  WFST-based  ap¬ 
proach  is  well-suited  to  devices  with  limited  computation  and 
memory.  We  achieve  this  efficiency  by  using  methods  that  allow 
us  to  perform  more  composition  and  graph  optimization  offline 
(such  as,  the  determinization  of  the  phrase  segmentation  trans¬ 
ducer  P)  than  in  previous  work,  and  by  utilizing  a  specialized 
decoder  involving  multilayer  search. 

During  the  offline  training,  we  separate  the  entire  translation  lat¬ 
tice  H  into  two  pieces:  the  language  model  L  and  the  translation 
model  M: 

M  =  Min  (Min  (Det  (P  )  °  T  )  °  W  ) 

where  o  is  the  composition  operator,  Min  denotes  the 
minimization  operation,  and  Det  denotes  the  determinization 
operation;  T  is  the  phrase  translation  transducer,  and  W  is  the 
phrase-to-word  transducer.  Due  to  the  determinizability  of  P,  M 
can  be  computed  offline  using  a  moderate  amount  of  memory. 

The  translation  problem  can  be  framed  as  finding  the  best  path  in 
the  full  search  lattice  given  an  input  sentence/automaton  /.  To 
address  the  problem  of  efficiently  computing  1  o  M  °  L ,  we  have 
developed  a  multilayer  search  algorithm. 

Specifically,  we  have  one  layer  for  each  of  the  input  FSM's:  /,  L, 
and  M.  At  each  layer,  the  search  process  is  performed  via  a  state 

traversal  procedure  starting  from  the  start  state  SQ ,  and  consum¬ 
ing  an  input  word  in  each  step  in  a  left-to-right  manner. 


We  represent  each  state  s  in  the  search  space  using  the  following 
7-tuple:  Sj,SM,SL,CM,CL,h,  s prm ,  where  S,,SM,  and 

SL  record  the  current  state  in  each  input  FSM;  CM  and  CL  record 
the  accumulated  cost  in  L  and  M  in  the  best  path  up  to  this  point; 
h  records  the  target  word  sequence  labeling  the  best  path  up  to 
this  point;  and  S prev  records  the  best  previous  state. 

To  reduce  the  search  space,  two  active  search  states  are  merged 
whenever  they  have  identical  Sj  ,  SM  ,  and  SL  values;  the  re¬ 
maining  state  components  are  inherited  from  the  state  with  lower 
cost.  In  addition,  two  pruning  methods,  histogram  pruning  and 
threshold  or  beam  pruning,  are  used  to  achieve  the  desired  bal¬ 
ance  between  translation  accuracy  and  speed. 

To  provide  the  decoder  for  the  PDA  devices  as  well  that  lacks  a 
floating-point  processor,  the  search  algorithm  is  implemented 
using  fixed-point  arithmetic. 


5.  CONCLUSION 

We  described  the  framework  of  the  IBM  MASTOR  system,  the 
various  technologies  used  in  building  major  components  for  lan¬ 
guages  with  different  levels  of  data  resources.  The  technologies 
have  shown  successes  in  building  real-time  S2S  systems  on  both 
laptop  and  small  computation  resource  platforms  for  two  lan¬ 
guage  pairs,  English-Mandarin  Chinese,  and  English-Arabic  dia¬ 
lect.  In  the  latter  case,  we  also  developed  approaches  which  lead 
to  very  rapid  (in  the  matter  of  3-4  months)  development  of  sys¬ 
tems  using  very  limited  language  and  domain  resources.  We  are 
working  on  improving  spontaneous  speech  recognition  accuracy 
and  more  naturally  integrating  two  translation  approaches. 
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